Potato Chips Retail Strategy and Analytics

Part1 - Data preparation and customer analytics

The objective of the analysis of chips segment is to help strategize plans for the next half year. The analysis comprises of identifying customer purchasing behaviour, in order to generate insights and provide commercial recommendations.

The workflow followed can be summarized as:

  • analysis of customers by segments based on their lifestage - young/ old/ retired etc. and loyalty level - premium/ budget/ mainstream
  • clean-up and segregation of products into brands and package size and selecting only the chips from all products
  • combining the two data to analyse on various metrics - sales, product pack size, brand etc.
  • compare preference of the most prominent customer group vs others segments for brand and pack size
  • analyse total sales per month by loyalty level of the customer

Import Libraries

(Plotly library for graphs and charts.)

In [1]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import plotly.io as pio
pio.templates.default = "plotly_white"

import statsmodels.api as sm
from scipy import stats

Read Datasets

In [2]:
# to identify the encoding of the datasets

# for file in ["QVI_purchase_behaviour.csv", "QVI_transaction_data.xlsx"]:
#     result = open(file = file, mode = "r")
#     print(result)
In [3]:
# read the purchase behaviour dataset 
purchase_behaviour = pd.read_csv("QVI_purchase_behaviour.csv")

# read the transactions dataset
transaction_data = pd.read_excel("QVI_transaction_data.xlsx")

Explore Datasets

In [4]:
purchase_behaviour.info()

pd.options.display.float_format = "{:.3f}".format

print("\n")

purchase_behaviour.describe(include = "all")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72637 entries, 0 to 72636
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   LYLTY_CARD_NBR    72637 non-null  int64 
 1   LIFESTAGE         72637 non-null  object
 2   PREMIUM_CUSTOMER  72637 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.7+ MB


Out[4]:
LYLTY_CARD_NBR LIFESTAGE PREMIUM_CUSTOMER
count 72637.000 72637 72637
unique nan 7 3
top nan RETIREES Mainstream
freq nan 14805 29245
mean 136185.932 NaN NaN
std 89892.932 NaN NaN
min 1000.000 NaN NaN
25% 66202.000 NaN NaN
50% 134040.000 NaN NaN
75% 203375.000 NaN NaN
max 2373711.000 NaN NaN
In [5]:
# dispaly few rows from purchase behaviour

purchase_behaviour.head().append(purchase_behaviour.tail())
Out[5]:
LYLTY_CARD_NBR LIFESTAGE PREMIUM_CUSTOMER
0 1000 YOUNG SINGLES/COUPLES Premium
1 1002 YOUNG SINGLES/COUPLES Mainstream
2 1003 YOUNG FAMILIES Budget
3 1004 OLDER SINGLES/COUPLES Mainstream
4 1005 MIDAGE SINGLES/COUPLES Mainstream
72632 2370651 MIDAGE SINGLES/COUPLES Mainstream
72633 2370701 YOUNG FAMILIES Mainstream
72634 2370751 YOUNG FAMILIES Premium
72635 2370961 OLDER FAMILIES Budget
72636 2373711 YOUNG SINGLES/COUPLES Mainstream
In [6]:
transaction_data.info()

print("\n")

transaction_data.describe()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264836 entries, 0 to 264835
Data columns (total 8 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   DATE            264836 non-null  int64  
 1   STORE_NBR       264836 non-null  int64  
 2   LYLTY_CARD_NBR  264836 non-null  int64  
 3   TXN_ID          264836 non-null  int64  
 4   PROD_NBR        264836 non-null  int64  
 5   PROD_NAME       264836 non-null  object 
 6   PROD_QTY        264836 non-null  int64  
 7   TOT_SALES       264836 non-null  float64
dtypes: float64(1), int64(6), object(1)
memory usage: 16.2+ MB


Out[6]:
DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_QTY TOT_SALES
count 264836.000 264836.000 264836.000 264836.000 264836.000 264836.000 264836.000
mean 43464.036 135.080 135549.476 135158.311 56.583 1.907 7.304
std 105.389 76.784 80579.978 78133.026 32.827 0.644 3.083
min 43282.000 1.000 1000.000 1.000 1.000 1.000 1.500
25% 43373.000 70.000 70021.000 67601.500 28.000 2.000 5.400
50% 43464.000 130.000 130357.500 135137.500 56.000 2.000 7.400
75% 43555.000 203.000 203094.250 202701.250 85.000 2.000 9.200
max 43646.000 272.000 2373711.000 2415841.000 114.000 200.000 650.000
In [7]:
# display few rows from transactions
transaction_data.head().append(transaction_data.tail())
Out[7]:
DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY TOT_SALES
0 43390 1 1000 1 5 Natural Chip Compny SeaSalt175g 2 6.000
1 43599 1 1307 348 66 CCs Nacho Cheese 175g 3 6.300
2 43605 1 1343 383 61 Smiths Crinkle Cut Chips Chicken 170g 2 2.900
3 43329 2 2373 974 69 Smiths Chip Thinly S/Cream&Onion 175g 5 15.000
4 43330 2 2426 1038 108 Kettle Tortilla ChpsHny&Jlpno Chili 150g 3 13.800
264831 43533 272 272319 270088 89 Kettle Sweet Chilli And Sour Cream 175g 2 10.800
264832 43325 272 272358 270154 74 Tostitos Splash Of Lime 175g 1 4.400
264833 43410 272 272379 270187 51 Doritos Mexicana 170g 2 8.800
264834 43461 272 272379 270188 42 Doritos Corn Chip Mexican Jalapeno 150g 2 7.800
264835 43365 272 272380 270189 74 Tostitos Splash Of Lime 175g 2 8.800

Observations:

  • date column in both the datasets is stored as serialized number - this needs to be converted to datetime format
  • columns like Product Number, Store Number etc. are numeric, but they represent nominal data
  • the common column to connect both the datasets is customer's Loyalty Card Number
  • there are no nulls in both the datasets

Convert Date column stored as serialized number in Excel

In [8]:
from datetime import datetime as dt

transaction_data["DATE"] = (transaction_data["DATE"].apply(
                                                     lambda date: dt.fromordinal(dt(1900, 1, 1).toordinal() + date - 2)))

transaction_data.head()
Out[8]:
DATE STORE_NBR LYLTY_CARD_NBR TXN_ID PROD_NBR PROD_NAME PROD_QTY TOT_SALES
0 2018-10-17 1 1000 1 5 Natural Chip Compny SeaSalt175g 2 6.000
1 2019-05-14 1 1307 348 66 CCs Nacho Cheese 175g 3 6.300
2 2019-05-20 1 1343 383 61 Smiths Crinkle Cut Chips Chicken 170g 2 2.900
3 2018-08-17 2 2373 974 69 Smiths Chip Thinly S/Cream&Onion 175g 5 15.000
4 2018-08-18 2 2426 1038 108 Kettle Tortilla ChpsHny&Jlpno Chili 150g 3 13.800
In [9]:
# for ease of code, applying lower case to all column names in both the data sets

purchase_behaviour.columns = [each.lower() for each in purchase_behaviour.columns]
transaction_data.columns = [each.lower() for each in transaction_data.columns]

Overview of Customer Distribution

In [10]:
cust_dist_df = pd.pivot_table(purchase_behaviour, index = ["lifestage", "premium_customer"],
                              values = ["premium_customer"], aggfunc = "size").reset_index()

cust_dist_df.rename({0 : "count", }, axis = 1, inplace = True)

fig = px.bar(data_frame = cust_dist_df,
             y = "lifestage",
             x = "count",
             barmode = "group",
             orientation = "h",
             color = "premium_customer", 
             labels = {"lifestage" : "Customer's Lifestage",
                       "count" : "Total Customers", 
                       "premium_customer" : "Customer Group"},
             title = {"text": "Overview of Total Customers by Group and Lifestage", "font_size" : 20, "x" : 0.5},
             color_discrete_sequence = px.colors.diverging.Portland,
            )

fig.show()

It is evident the customer group Mainstream is a prominent contributor for sales (this overview includes all products).

New Families show the least sales.

Identify Outliers in Total Sales and Product Quantity

As these columns represent quantitative data, outliers have been identified for these two columns only.

In [11]:
fig = make_subplots(rows = 2, cols = 1, shared_xaxes = False)

fig.add_trace(go.Box(x = transaction_data["tot_sales"], name = "Total Sales"), row = 1, col = 1)
fig.add_trace(go.Box(x = transaction_data["prod_qty"], name = "Product Quantity"), row = 2, col = 1)

fig.update_layout(width = 900, height = 250, showlegend = False, 
                  title = "Outliers in Total Sales and Product Quantity",
                  font = {"size": 14})
fig.show()

For both product quantity and sales, there is only one extreme outlier for each. Let's filter all the transactions which constitute these two values.

In [12]:
# analysis of outliers 

transaction_data[(transaction_data["tot_sales"] > 600) | (transaction_data["prod_qty"] > 150)]
Out[12]:
date store_nbr lylty_card_nbr txn_id prod_nbr prod_name prod_qty tot_sales
69762 2018-08-19 226 226000 226201 4 Dorito Corn Chp Supreme 380g 200 650.000
69763 2019-05-20 226 226000 226210 4 Dorito Corn Chp Supreme 380g 200 650.000
In [13]:
# customer details associated with the outliers

loyalty_number = (transaction_data.loc[(transaction_data["tot_sales"] > 600) | (transaction_data["prod_qty"] > 150),
                                       "lylty_card_nbr"])

purchase_behaviour[purchase_behaviour["lylty_card_nbr"].isin(loyalty_number) == True]
Out[13]:
lylty_card_nbr lifestage premium_customer
59694 226000 OLDER FAMILIES Premium

Observations:

  • the outliers in product quantity and total sales both relate to same transactions
  • there is only one customer who has made both the purchases
  • since this is a one off case, this may not be a retail type transaction at all - this outlier can be removed from the transactions data

One interesting thing to note here - there are only 9 transactions in almost a year at the store 226!? Is this scenario applicable for other stores as well?

Remove outliers, Analyse correlation between Quantity & Sales

In [14]:
# remove outlier from transaction data 

transaction_data = transaction_data[transaction_data["lylty_card_nbr"].isin(loyalty_number) == False]

fig = make_subplots(rows = 1, cols = 2, shared_yaxes = True, 
                    subplot_titles = ["Distribution", "Correlation"])

fig.add_box(y = transaction_data["prod_qty"], name = "Product Qty", row = 1, col = 1, ) 
fig.add_box(y = transaction_data["tot_sales"], name = "Total Sales", row = 1, col = 1,)


fig.add_scattergl(x = transaction_data["prod_qty"],
                  y = transaction_data["tot_sales"],
                  mode = "markers", name = "", 
                  hoverinfo = "skip",
                  row = 1, col = 2,
                  )

fig.update_layout(title = {"text":"Distribution and Correlation of Total Sales and Product Quantity",
                           "font_size" : 20, "x" : 0.5},
                  showlegend = False,
                 )
fig.show()

The correlation between product quantity purchased and sales made is heteroscedastic.

In [15]:
tot_sales_df = (pd.pivot_table(data = transaction_data, index = ["date"], 
                               values = "tot_sales", aggfunc = np.sum).reset_index().round(2))
In [16]:
fig = px.scatter(data_frame = tot_sales_df, 
                 x = "date", y = "tot_sales",
                 color = "tot_sales", size = "tot_sales", size_max = 12,
                 title = {"text": "Total Sales Trend for the Year for All Products", "font_size" : 20, "x" : 0.5},
                 labels = {"date": "Date", "tot_sales": "Total Sales"},
                 color_continuous_scale = px.colors.diverging.Portland, 
                )

fig.update_layout(xaxis_title = "Month-Year")
fig.show()

The highest sales has been recorded for the month of Dec 2018 and the lowest sales have been recorded for Aug 2018 and May 2019.

Segregate Product Details

Select only Potato Chip Products

In [17]:
print("Extracted brand names:")

for each_brand in transaction_data["prod_name"].str.split().str[0].value_counts(dropna = False).sort_index().index:
    print(each_brand)
Extracted brand names:
Burger
CCs
Cheetos
Cheezels
Cobs
Dorito
Doritos
French
Grain
GrnWves
Infuzions
Infzns
Kettle
NCC
Natural
Old
Pringles
RRD
Red
Smith
Smiths
Snbts
Sunbites
Thins
Tostitos
Twisties
Tyrrells
WW
Woolworths

Brand names like Smith and Smiths, GrnWves and Grain Waves, etc. represent the same brand. However with unformated spellings they appear to be different. Time for some clean-up!

In [18]:
brand_names = {"RRD": "Red Rock Deli", "NCC" : "Natural Chips Co", r"\bDorito\b" : "Doritos", "GrnWves" : "Grain Waves", 
               r"\bSmith\b" : "Smiths", "Snbts" : "Sunbites", "WW" : "Woolworths", "Infzns" : "Infuzions"} 

transaction_data["prod_name"].replace(brand_names, regex = True, inplace = True)

transaction_data["brands"] = (transaction_data["prod_name"].str.replace(r"[^\w\s]+", "").
                                str.replace(r"\d+.", " ").str.replace(r"\s+", " ").str.split().str[0].str.strip())

Filter out products that are Not Potato Chips

Here I have deviated from what was suggested in the sample project. I found these two products - Red Rock Deli SR Salsa Mzzrlla, and Smiths Crinkle Cut Tomato Salsa which are potato chip or like, but have a salsa flavour. So if we just remove the products with name containing Salsa, we loose these products as well. Although they are just two products out of the 107 (= 114 - 7 genuine Salsa products) chips products, I didn't drop them from transactions data.

So instead of a general drop code, I specified the Salsa products as a list and then selected only the true Non-Salsa products.

In [19]:
transaction_data.loc[transaction_data["prod_name"].str.contains("Salsa") == True, "prod_name"].unique()
Out[19]:
array(['Old El Paso Salsa   Dip Tomato Mild 300g',
       'Red Rock Deli SR    Salsa & Mzzrlla 150g',
       'Smiths Crinkle Cut  Tomato Salsa 150g',
       'Doritos Salsa       Medium 300g',
       'Old El Paso Salsa   Dip Chnky Tom Ht300g',
       'Woolworths Mild     Salsa 300g',
       'Old El Paso Salsa   Dip Tomato Med 300g',
       'Woolworths Medium   Salsa 300g', 'Doritos Salsa Mild  300g'],
      dtype=object)
In [20]:
non_chip_product = ['Old El Paso Salsa Dip Tomato Mild', 
                    'Woolworths Mild Salsa',
                    'Doritos Salsa Medium', 
                    'Old El Paso Salsa Dip Chnky Tom Ht',
                    'Old El Paso Salsa Dip Tomato Med', 
                    'Woolworths Medium Salsa',
                    'Doritos Salsa Mild']

transaction_data = transaction_data[transaction_data["prod_name"].isin(non_chip_product) != True]

Separate Pack size from Product name

In [21]:
transaction_data["pack_size_gms"] = transaction_data["prod_name"].str.extract(pat = "(\d+)")
transaction_data["pack_size_gms"] = transaction_data["pack_size_gms"].str.strip().str.lower().astype(int)
In [22]:
import matplotlib.pyplot as plt
%matplotlib inline

fig, (ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, figsize = (12, 4))

transaction_data["pack_size_gms"].hist( ax = ax1)
transaction_data["pack_size_gms"].plot.kde( ax = ax2)

ax1.set_title("Distribution of Pack Sizes", size = 16)
ax1.set_xlabel("Pack Size", size = 12)
ax1.set_ylabel("Total Transactions", size = 12)

ax2.set_title("Kernel Density Estimate of Pack Sizes", size = 16)
ax2.set_xlabel("Pack Size", size = 12)
ax2.set_ylabel("Density", size = 12)

plt.tight_layout()
plt.show()

Combine Customer and Transaction details

In [23]:
chips_retail = pd.merge(right = purchase_behaviour, 
                          left = transaction_data,
                          on = "lylty_card_nbr",
                          how = "left")

# check for any null values 
chips_retail.isnull().sum()

# generate ".csv" file for task 2 of the project
# chips_retail.to_csv("chips_retail.csv", index = False)
Out[23]:
date                0
store_nbr           0
lylty_card_nbr      0
txn_id              0
prod_nbr            0
prod_name           0
prod_qty            0
tot_sales           0
brands              0
pack_size_gms       0
lifestage           0
premium_customer    0
dtype: int64

Analyse the combined data

In [24]:
segment_df = pd.pivot_table(data = chips_retail, 
                            values = ["tot_sales", "prod_qty"],
                            index = ["lifestage", "premium_customer"], 
                            aggfunc = {"tot_sales": [np.sum, np.mean],
                                       "prod_qty" : [np.sum, np.mean]}
                           )

segment_df.columns = segment_df.columns.map("_".join)
segment_df.reset_index(inplace = True)
segment_df = segment_df.round(5)
In [25]:
fig = make_subplots(rows=1, cols=3, specs=[[{"type": "domain"}, {"type": "domain"},{"type": "domain"}]])

segments = ["Budget", "Premium", "Mainstream"]

for each_group in segments:
    
    # add pie trace to each plot
    fig.add_trace(go.Pie(labels = segment_df.loc[segment_df["premium_customer"] == each_group, "lifestage"],
                         values = segment_df.loc[segment_df["premium_customer"] == each_group, "tot_sales_sum"], 
                         name = each_group,
                        ),
                  row = 1, col = segments.index(each_group) + 1)

fig.update_traces(hole = 0.6,
                  hoverinfo="label+value+name",
                  sort = False,
                  marker_colors = px.colors.diverging.Portland,
                  scalegroup = "one",
                  hovertext = "label+value+name")

fig.update_layout(legend = {"orientation": "h", "yanchor": "bottom", "y": -0.25, "xanchor": "center", "x": 0.5},
                  title = {"text": "Total Sales by Customer Group and Customer's Lifestage",
                           "font_size": 20, "x" : 0.5},
                  annotations = [{"text" : "Budget", "x" : 0.1, "y" : 0.5, "font_size": 15, "showarrow": False},
                                 {"text" : "Premium", "x" : 0.5, "y" : 0.5, "font_size": 15, "showarrow": False},
                                 {"text" : "Mainstream", "x" : 0.91, "y" : 0.5, "font_size": 15, "showarrow": False}],
                  height = 440)

fig.show()

Observations:

  • Even after removal of the Salsa products the customer purchase behaviour for Chips resembles that of the overall products i.e. Mainstream customers contribute maximum sales.
  • The major contributors to chips sale are:
    • Budget + Older Families
    • Mainstream + Young Singles/Couples
    • Mainsream + Retirees

in Total Sales for respective customer group.

  • For each group + Older Singles/Couples the total sales figures are very close as well.
In [26]:
px.scatter(data_frame = segment_df,
           y = "lifestage",
           x = "tot_sales_mean",
           symbol = "premium_customer", color = "premium_customer",
           size = "tot_sales_mean", size_max = 12,
           labels = {"lifestage": "Customer's Life Stage", 
                     "premium_customer" : "Customer Class",
                     "tot_sales_mean" : "Average Sales per unit for the Segment"},
           height = 350,
           title = {"text": "Average Sales per unit for Customer Class and Life Stage Segment",
                    "font_size" : 20},
           color_discrete_sequence = px.colors.diverging.Portland
          )  
  • The average sales for each customer group segment lies between 6.6 and 7.7.
  • It also does not vary much between the lifestages except, for the Midage & Young Singles/ Couples group.
In [27]:
px.scatter(data_frame = segment_df,
           y = "lifestage",
           x = "prod_qty_mean",
           symbol = "premium_customer", color = "premium_customer",
           size = "prod_qty_mean", size_max = 12,
           labels = {"lifestage": "Customer's Life Stage", 
                     "premium_customer" : "Customer Group",
                     "prod_qty_mean" : "Average Quantity for the Segment"},
           height = 350,
           title = {"text": "Average Quantity Purchased by Customer Group and Life Stage Segment",
                    "font_size": 20},
           color_discrete_sequence = px.colors.diverging.Portland
          )  

Similar to the average sales, the average quantity purchased by the customers from each of the segment does not have a higher variance.

How different is Avg sales for Mainstream Midage & Young Singles/Couples?

This t-test will help us understand if the average price spent by Mainstream Midage & Young Single/ Couple is significantly different from average price spent by Premium & Budget Midage & Young Single/ Couple Customers.

Hypothesis for the test:

$H_0 : \bar{x}_M = \bar{x}_{(B \space or \space P)}$

$H_{\alpha} : \bar{x}_M \neq \bar{x}_{(B \space or \space P)}$

At significance or $\alpha = 5\%$

I was not sure if we had to take Budget and Premium together or spearately for this test, so I did the test for both options inidividually and combined.

In [28]:
mainstream = chips_retail.loc[(chips_retail["premium_customer"] == "Mainstream") &
                              (chips_retail["lifestage"].isin(["MIDAGE SINGLES/COUPLES", 
                                                               "YOUNG SINGLES/COUPLES"])), "tot_sales"]


budget = chips_retail.loc[(chips_retail["premium_customer"] == "Budget") &
                          (chips_retail["lifestage"].isin(["MIDAGE SINGLES/COUPLES",
                                                           "YOUNG SINGLES/COUPLES"])), "tot_sales"]

premium = chips_retail.loc[(chips_retail["premium_customer"] == "Premium") &
                           (chips_retail["lifestage"].isin(["MIDAGE SINGLES/COUPLES", 
                                                            "YOUNG SINGLES/COUPLES"])), "tot_sales"]

var = [np.var(each) for each in [mainstream, budget, premium]]
 
t_stats, p_value = stats.ttest_ind(a = mainstream, b = budget, equal_var = False)
print("For Mainstream vs Budget test statistics and the p-value are {:.3f} and {:.5f}".format(t_stats, p_value))

t_stats, p_value = stats.ttest_ind(a = mainstream, b = premium, equal_var = False)
print("For Mainstream vs Premium test statistics and the p-value are {:.3f} and {:.5f}".format(t_stats, p_value))


# budget and premium added together
bgt_prm = chips_retail.loc[(chips_retail["premium_customer"] != "Mainstream") &
                           (chips_retail["lifestage"].isin(["MIDAGE SINGLES/COUPLES", 
                                                            "YOUNG SINGLES/COUPLES"])), "tot_sales"]

np.var(bgt_prm)

t_stats, p_value = stats.ttest_ind(a = mainstream, b = bgt_prm, equal_var = False)
print("For Mainstream vs Others test statistics and the p-value are {:.3f} and {:.5f}".format(t_stats, p_value))
For Mainstream vs Budget test statistics and the p-value are 31.137 and 0.00000
For Mainstream vs Premium test statistics and the p-value are 26.562 and 0.00000
For Mainstream vs Others test statistics and the p-value are 36.045 and 0.00000

Based on the t-test results, we can conclude that, at 5% significance, statistically, average sales for Mainstream - midage & young singles/ couples segment is different as compared to

  • Budget - midage & young singles/ couples
  • Premium - midage & young singles/ couples
  • Budget & Premium - midage & young singles/ couples - combined

Brand Affinity Analysis

(Affinity Analysis is fairly a new subject for me)

The chips brand most preferred, in comparison to all other brands, by Mainstream - Young Singles/ Couples segment, is Tyrrells. Twisties and Kettle are the next top contenders.

The products under Tyrrells brands are, which both are packaged in 165 gms packs:

  • Tyrrells Crisps Lightly Salted 165g &
  • Tyrrells Crisps Ched & Chives 165g
In [29]:
main_segment_qty_sum = chips_retail.loc[(chips_retail["premium_customer"] == "Mainstream") & 
                                        (chips_retail["lifestage"] == "young singles/couples".upper()), "prod_qty"].sum()

others_qty_sum = chips_retail.loc[(chips_retail["premium_customer"] != "Mainstream") & 
                                  (chips_retail["lifestage"] != "young singles/couples".upper()), "prod_qty"].sum()

main_segment = chips_retail.loc[(chips_retail["premium_customer"] == "Mainstream") & 
                                (chips_retail["lifestage"] == "young singles/couples".upper())]

others = chips_retail.loc[(chips_retail["premium_customer"] != "Mainstream") & 
                          (chips_retail["lifestage"] != "young singles/couples".upper())]

main_segment_ratio = pd.Series(main_segment.groupby("brands")["prod_qty"].sum()/main_segment_qty_sum)

others_ratio = pd.Series(others.groupby("brands")["prod_qty"].sum()/others_qty_sum)


brand_affinity = pd.concat([main_segment_ratio, others_ratio, 
                            pd.Series(main_segment_ratio / others_ratio)], axis = 1).reset_index()

brand_affinity.columns = ["brands", "mainstream_young", "others", "brand_affinity"]

brand_affinity.sort_values("brand_affinity", ascending = False)
Out[29]:
brands mainstream_young others brand_affinity
19 Tyrrells 0.030 0.024 1.234
18 Twisties 0.043 0.035 1.225
9 Kettle 0.186 0.155 1.196
17 Tostitos 0.043 0.036 1.191
11 Old 0.042 0.035 1.191
12 Pringles 0.112 0.094 1.188
4 Cobs 0.042 0.036 1.168
5 Doritos 0.123 0.105 1.167
8 Infuzions 0.061 0.054 1.133
16 Thins 0.057 0.053 1.063
7 Grain 0.031 0.029 1.059
3 Cheezels 0.017 0.018 0.956
14 Smiths 0.093 0.121 0.770
6 French 0.004 0.005 0.696
2 Cheetos 0.008 0.011 0.687
13 Red 0.045 0.068 0.664
10 Natural 0.018 0.029 0.637
1 CCs 0.010 0.017 0.610
15 Sunbites 0.006 0.012 0.506
20 Woolworths 0.028 0.056 0.501
0 Burger 0.003 0.006 0.450

Chips from Tyrrells Brand

In [30]:
print(chips_retail.loc[chips_retail["brands"] == "Tyrrells", "prod_name"].unique())
['Tyrrells Crisps     Lightly Salted 165g'
 'Tyrrells Crisps     Ched & Chives 165g']
In [31]:
fig = px.bar_polar(data_frame = brand_affinity.round(3),
                   theta = "brands",
                   r = "brand_affinity",
                   color = "brand_affinity",                 
                   color_continuous_scale = px.colors.diverging.Portland,
                   title = {"text": "Brand Affinity for Mainstream - Young Singles/Couples Segment",
                            "font_size" : 20, "x" : 0.5},
                   labels = {"brand_affinity": "Affinity"},
                   height = 600
                   )

fig.show()

Pack Size Affinity Analysis

The pack size most preferred, in comparison to all other packs, by Mainstream - Young Singles/ Couples segment, is weighs 270gms followed by the 380gms and 330gms pack.

The two products available in 270 gms pack are:

  • Twisties Cheese &
  • Twisties Chicken

which again is the second most preferred brand for the customer.

In [32]:
main_segment_ratio = pd.Series(main_segment.groupby("pack_size_gms")["prod_qty"].sum()/main_segment_qty_sum)

others_ratio = pd.Series(others.groupby("pack_size_gms")["prod_qty"].sum()/others_qty_sum)

pack_affinity = pd.concat([main_segment_ratio, others_ratio, 
                            pd.Series(main_segment_ratio / others_ratio)], axis = 1).reset_index()

pack_affinity.columns = ["pack_size_gms", "mainstream_young", "others", "pack_affinity"]

pack_affinity["pack_size_gms"] = pack_affinity["pack_size_gms"].astype(str).apply(lambda s: s + "g")

pack_affinity.sort_values("pack_affinity", ascending = False)
Out[32]:
pack_size_gms mainstream_young others pack_affinity
17 270g 0.030 0.023 1.277
20 380g 0.030 0.024 1.258
19 330g 0.057 0.048 1.210
2 110g 0.100 0.083 1.194
4 134g 0.112 0.094 1.188
14 210g 0.027 0.023 1.177
5 135g 0.014 0.012 1.149
16 250g 0.013 0.012 1.123
9 170g 0.076 0.075 1.011
18 300g 0.055 0.057 0.969
6 150g 0.155 0.163 0.950
10 175g 0.239 0.253 0.945
8 165g 0.052 0.057 0.909
12 190g 0.007 0.011 0.620
11 180g 0.003 0.006 0.584
7 160g 0.006 0.011 0.527
3 125g 0.003 0.006 0.507
1 90g 0.006 0.012 0.506
13 200g 0.008 0.017 0.489
0 70g 0.003 0.006 0.486
15 220g 0.003 0.006 0.450

Chips Products in pack size 270 gms

In [33]:
print([each for each in chips_retail.loc[chips_retail["pack_size_gms"] == 270, "prod_name"].unique()])
['Twisties Cheese     270g', 'Twisties Chicken270g']
In [34]:
fig = px.bar_polar(data_frame = pack_affinity.round(3),
                   theta = "pack_size_gms",
                   r = "pack_affinity",
                   color = "pack_affinity",                 
                   color_continuous_scale = px.colors.diverging.Portland_r,
                   title = {"text": "Pack Size Affinity for Mainstream - Young Singles/Couples Segment",
                            "font_size" : 20, "x" : 0.5},
                   labels = {"pack_affinity": "Affinity"},
                   height = 600
                  )

fig.show()

Additional Visualizations

All Year Total Sales for Chips Products Only

In [35]:
df = chips_retail.groupby("date", as_index = False)["tot_sales"].sum()

df = df.append({"date" : "2018-12-25", "tot_sales" : np.nan}, ignore_index = True)

df["date"] = pd.to_datetime(df["date"]).dt.strftime("%Y-%m-%d")
df = df.sort_values("date")

fig = go.Figure()

fig.add_trace(go.Scatter(x = df["date"],
                         y = df["tot_sales"],
                         mode = "lines", line = {"color" : px.colors.diverging.Portland[0]}))

fig.update_layout(title = {"text" : "All Year Total Sales for Chips", "font_size" : 20, "x" : 0.5},
                  height = 450, 
                  xaxis = {"title" : "Date-Month-Year"},
                  yaxis = {"title" : "Total Sales for the Day", "range" : [3700, 7100]}, 
                  font_size = 13)

fig.show()

Month-wise total sales

To further analyze the Aug 2018, Dec 2018 and May 2019.

In [36]:
df = chips_retail.groupby(["date", "premium_customer"])["tot_sales"].sum().reset_index().round(2)

y_limits =  [min(df["tot_sales"]) - 100, max (df["tot_sales"]) + 100]

from datetime import datetime as dt

# df["month"] = df["date"].dt.strftime("%B")

fig = px.scatter(data_frame = df,
                 x = df["date"].dt.strftime("%d"),
                 y = "tot_sales",
                 color = "premium_customer", symbol = "premium_customer",
                 size = "tot_sales", size_max = 12, 
                 animation_frame = df["date"].dt.strftime("%B-%Y"),
                 height = 400,
                 range_y = y_limits,
                 color_discrete_sequence = px.colors.diverging.Portland,
                 labels = {"x": "Day", "tot_sales" : "Total Sales", 
                           "premium_customer" : "Customer Group", "animation_frame" : "Month-Year"}
                )
fig["layout"].pop("updatemenus")
fig.update_layout(title = {"text" : "Month-wise Sales", "font_size" : 20, "x": 0.5}, font_size = 14)

fig.show()
  • 25th Dec being Christmas, does not have any sales.
  • Aug 2018 and May 2019 show a dip in sales between 15th and 20th of the month.
  • For a specific day, 24th Dec records maximum total sales for the Mainstream segment

Summary & Conclusion

Out of the three customer groups - Premium, Budget & Mainstream - Mainstream contributes maximum to the potato chips sales. Lifestage wise, Budget Older Families, Mainstream - Young Singles/ Couples and Retirees customers contribute highest among all segments.

Mainstream - Midage & Young Singles/Couples buyers are inclined 23% higher, towards the chip brand Tyrells as compared to all other brands. The average spends of Mainstream - Midage & Young Singles/Couples customers is also higher than the Budget and Premium customers for the same lifestages.

We can conclude that Chips Segment should focus on Tyrells & Twisties brands towards the Mainstream Young Singles/Couples segment to help drive sales further.